Processor array including delay elements associated with primary bus nodes

ABSTRACT

There is disclosed a processor array, which achieves an approximately constant latency. Communications to and from the farthest array elements are suitably pipelined for the distance, while communications to and from closer array elements are deliberately “over-pipelined” such that the latency to all end-point elements is the same number of clock cycles. The processor array has a plurality of primary buses, each connected to a primary bus driver, and each having a respective plurality of primary bus nodes thereon; respective pluralities of secondary buses, connected to said primary bus nodes; a plurality of processor elements, each connected to one of the secondary buses; and delay elements associated with the primary bus nodes, for delaying communications with processor elements connected to different ones of the secondary buses by different amounts, in order to achieve a degree of synchronization between operation of said processor elements.

BACKGROUND

This invention relates to a processor array, and in particular to alarge processor array which requires multi-bit, bidirectional, highbandwidth communication to one processor at a time, to all theprocessors at the same time or to a sub-set of the processors at thesame time. This communication might be needed for data transfer, such asloading a program into a processor or reading back status or resultinformation from a processor, or for control of the processor array,such as the synchronous starting, stopping or singlestepping of theindividual processors.

GB-A-2370380 describes a large processor array, in which each processor(array element) needs to store the instructions which make up anoperating program, and then needs to be controllable so that it runs theoperating program as desired. Since the array elements pass data fromone to another, it is essential that the processors are at leastapproximately synchronised. Therefore, they must be started (i.e.commence running their programs) at the same time. Likewise, if they areto be stopped at some time and then re-started, they need to be stoppedat the same time.

Due to the large number of array elements, and the relatively large sizeof their instruction stores, data stores, register files and so on, itis advantageous to be able to load the program for each array elementquickly.

Due to the size of the processor array it is difficult to minimise theamount of clock skew between each array element and, in fact, it isadvantageous from the point of view of supplying power to the arrayelements to have a certain amount of clock skew. That is, it isnecessary for the array elements to be synchronised to within about oneclock cycle of each other.

For synchronous control of an array of processors, the simplest solutionwould be to wire the control signals to all array elements in a parallelfan-out. This has the limitation of becoming unwieldy once the array islarger than a certain size. Once the distance the signals have to travelis so long as to cause the signals to take longer than one clock cycleto reach the most distant array elements, it becomes difficult topipeline the control signals efficiently and to balance the end-pointarrival times over all operating conditions. This imposes an upper limiton the clock speed that can be used, and hence the bandwidth ofcommunications. Additionally, this approach is not well suited to beingable to talk to just one processor at a time in one mode and then to allprocessors at once in another mode.

For high bandwidth communications to multiple end-points,packet-switched or circuit-switched networks are a good solution.However, this approach has the disadvantage of not generally beingsynchronous at all the end-points. The latency to end-points furtheraway is longer than to end-points that are close. This also requires thenodes of the network to be quite intelligent and hence complex.

It is also necessary to consider the issue of scaleability. A designthat works well in one processor array may have to be completelyredesigned for a slightly larger array, and may be relativelyinefficient for a smaller array.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block schematic diagram of a processor array according tothe present invention.

FIG. 2 is a block schematic diagram of a first embodiment of a primarynode in the array of FIG. 1.

FIG. 3 is a block schematic diagram of a second embodiment of a primarynode in the array of FIG. 1.

FIG. 4 is a block schematic diagram of a secondary node in the array ofFIG. 1.

FIG. 5 is a more detailed block schematic diagram of a part of the arrayof FIG. 1.

FIG. 6 is a more detailed block schematic diagram of a second part ofthe array of FIG. 1.

FIGS. 7 and 8 show parts of the array of FIG. 1, in use.

DETAILED DESCRIPTION

FIG. 1 shows an array of processors 4, which are all connected to acolumn driver 1 over buses 5. As illustrated, the array is made up ofhorizontal rows and vertical columns of array elements 4, although theactual physical positions of the array elements are unimportant for thisinvention. Each row of array elements has been divided into sub-groups6. The array elements 4 within one sub-group 6 are connected together ona horizontal bus segment 7 via respective row nodes 3. The horizontalbus segments 7 are connected to vertical buses 8 via respective columnnodes 2. Each sub-group 6 contains array elements with which the columnnode 2 can easily communicate within a single clock cycle. Thus, thevertical buses 8 act as primary buses, the column nodes 2 act as primarybus nodes, the horizontal bus segments 7 act as secondary buses, and therow nodes 3 act as secondary bus nodes.

Each vertical bus 8 is driven individually by the column driver 1, aswill be described in more detail below with reference to FIG. 5. Thisserves as part of the communication routing and as a means of conservingpower.

Each of the buses 5, 7, 8 is in fact a pair of uni-directional,multi-bit buses, one in each direction, although they are shown as asingle line for clarity.

The column nodes 2 take two different forms, shown in FIGS. 2 and 3respectively. FIG. 2 shows a column node without a vertical pipelinestage, while FIG. 3 shows a column node with a vertical pipeline stage.

In the column node 10 shown in FIG. 2, the outgoing part of the verticalbus 8, carrying data from the column driver 1, propagates straightthrough the node 10 from an inlet 12 to an outlet 15. It is also tappedoff, at a connection 26, to a further bus 25. The bus 25 is connected toan outgoing part 13 of the horizontal bus segment 7 via a short, tappeddelay line 18. The tapped delay line 18 allows the signal to thehorizontal bus segment 7 to be delayed by a predetermined integer numberof clock cycles. The return path part 14 of the horizontal bus segment7, carrying data to the column driver 1, is also passed through a short,tapped delay line 19 to connect to a bus 20. The delay line 19 delaysthe return signal by a predetermined integer number of clock cycles. Thedelay in the delay line 19 is preferably the same as the delay in thedelay line 18, although the delay in the delay line 19 could be chosento be different from the delay in the delay line 18, provided that thedelays in the different nodes 10 are set so that there is the same totaldelay when sending signal to all end points, and when receiving signalsfrom all end points. The bus 20 is combined with the return path of thevertical bus received at an input 16 in a bitwise, logical OR function17 to form a return path vertical bus signal for output 11.

FIG. 3 shows an alternative form of column node 22. Features of thecolumn node 22, which have the same functions as features of the columnnode 10 shown in FIG. 2, are indicated by the same reference numerals,and will not be described again below. Compared with the column node 10,the column node 22 has a vertical pipeline stage. Thus, there is apipeline register 23 inserted in the outgoing part of the vertical bus8, which delays outgoing signals by one clock cycle, and a pipelineregister 24 inserted in the return path of the vertical bus 8, whichsimilarly delays return signals by one clock cycle.

Both of the types of column node 10, 22 provide a junction between thevertical bus 8 and the horizontal bus segments 7 for the sub-groups 6 ofarray elements 4. The column node 22 which has a vertical pipeline stageallows the total vertical bus path to be longer than a single clockcycle. The column node 10 without the vertical pipeline stage allows thejunction to be provided without adding a pipeline stage to the verticalbus path. Using the two types of column node in conjunction with eachother, as described in more detail below, allows sufficient pipeliningto enable high bandwidth communications without having to reduce theclock speed, but without an unnecessarily large and hence inefficienttotal pipeline depth.

FIG. 4 shows in more detail a row node 3, of the type shown in FIG. 1.The outgoing part of the horizontal bus 7, carrying data from the columndriver 1, propagates straight through the node from an inlet 51 to anoutlet 53. It is also tapped off, at a connection 50, to a further bus60. The bus 60 is connected to an array element interface 57.

The array element interface 57 connects to one of the array elements, asshown in FIG. 1, via buses 55 and 56. The array element interface 57interprets the bus protocol to determine if received communications areintended for the specific array element, which is connected to this rownode. Information which is read back from the array element connected tothis row node 3 is received in the interface 57, and output on a bus 59.A return path part of the horizontal bus 7, carrying data towards thecolumn driver 1, is received at an input 54. The bus 59 is combined withthe return path of the horizontal bus 7 in a bitwise, logical ORfunction 58 to form a return path horizontal signal for output 52.

FIG. 5 shows in more detail one of the row sub-groups 6, shown inFIG. 1. In this illustrated example, the sub-group 6 contains four arrayelements 4, although there may be more or less than four elements in asub-group, depending upon the number of elements with which the columndriver 1 can communicate effectively in a single clock cycle. The fourarray elements 4 are connected to the horizontal bus 7 via respectiverow nodes 3. Data is received on the outgoing horizontal bus segment 13(shown in FIGS. 2 and 3), and output on the return path part 14 (alsoshown in FIGS. 2 and 3) of the horizontal bus segment 13. The outgoinghorizontal bus segment is left not connected at the far end 62. Thereturn path horizontal bus segment is terminated with logical all-zeros,or grounded, at its far end 63. This is to avoid corruption of anyreturn path data, which may be logically ORed onto the bus 7 via any ofthe horizontal nodes 3.

FIG. 6 shows in more detail the column driver 1 from FIG. 1. In thisillustrated example, the number of columns is four, but the number ofcolumns could be more or less than four. Outgoing data for the arrayelements 4 is received from an array control processor (not shown) on abus 31, which is wired in parallel to the outgoing parts 33, 35, 37, 39of each of the four vertical buses connected to the respective columns.

The bus 31 is connected to the outgoing parts 33, 35, 37, 39 viarespective bitwise, logical AND functions 43. The logical AND functions43 also receive enabling signals 44 from a protocol snooping block 42.The protocol snooping block 42 watches the communications on the bus 31and, based on the address signals amongst the data, it generatesenabling signals which enable each column individually or all togetheras appropriate.

The return path parts 34, 36, 38, 40 of each of the four vertical busesconnected to the respective columns are combined in a bitwise, logicalOR function 41 to generate the overall return path bus 32 to transferdata from the array elements 4 to the array control processor.

As shown in FIG. 6, the column driver connects to four columns. However,when the array contains a large number of elements 4, and/or thesub-groups 6 only contain small numbers of elements 4, the number ofcolumns may become large. In that case, additional pipeline stages maybe required to ensure that the delays to and from all of the end pointsremain the same. For example, additional pipeline registers could beprovided in one or more of the branches 33-40, and/or at one or more ofthe inputs to the OR gate 41, and/or at the inputs to one or more of theAND gates 43.

The overall outgoing path bus is therefore a simple parallel connectionwith the addition of some pipeline stages and some high-level switching.The high level switching performs part of the array element addressingfunction, and helps to conserve power.

The overall return path bus is a simple logical OR fan-in with theaddition of some pipeline stages. No arbitration is necessary because ofthe constant latency of the bus, because the array control processorwill only read from one array element at a time, and because arrayelements that are not being addressed transmit logical all-zeros ontothe bus.

This still allows tight pipelining of read accesses and avoids the useof tri-state buses.

FIGS. 7 and 8 show two possible arrangements of column nodes. In both ofthese arrangements, the column nodes which are nearer to the column busdriver introduce longer delays, by way of their tapped delay lines, thanthe column nodes which are further from the column bus driver.

In FIG. 7, the column node which is closest to the column bus driver isa node 22 with a vertical pipeline stage, as shown in FIG. 3 andrepresented in FIG. 7 by a solid circle, and each fourth column nodethereafter also has a vertical pipeline stage, while the other columnnodes are nodes 10 which do not have a vertical pipeline stage, as shownin FIG. 2 and represented in FIG. 7 by a circle. In FIG. 8, the columnnode which is closest to the column bus driver is a node 22 with avertical pipeline stage, as shown in FIG. 3 and represented in FIG. 8 bya solid circle, and each third column node thereafter also has avertical pipeline stage, while the other column nodes are nodes 10 whichdo not have a vertical pipeline stage, as shown in FIG. 2 andrepresented in FIG. 8 by a circle. The actual spacing of the nodes witha vertical pipeline stage would depend upon the physical implementation.The spacing should be chosen in order to use the minimum number of nodeswith vertical pipeline stages whilst maintaining correct operation ofthe bus over all operating conditions. The nodes with vertical pipelinestages may be regularly spaced, or may be irregularly spaced, ifrequired. This illustrates the scaleability of this approach, since allthat is changing is the overall latency, not the bandwidth.

FIGS. 7 and 8 also illustrate exemplary configurations of the tappeddelay lines 18, 19 in each column node. In FIG. 7, starting at thecolumn node which is most distant from the column bus driver 1, namelythe node 74, the tapped delay lines 18, 19 have a delay time, D, whichis set to the minimum delay time, namely 0 clock cycles in this example.Then, the delay times are allocated by moving up the column, andincrementing the delay time by 1 clock cycle each time a pipelined node22 is passed. Thus, in FIG. 7, the tapped delay lines 18, 19 in thepipelined node 75 still have a delay time D=0, since the horizontalbranch in this node is after the pipeline registers 23, 24. The nextnode 76 is configured with the tapped delay lines 18, 19 having a delaytime D=1 clock cycle.

This process can be repeated until the column node nearest the columnbus driver is reached. Thus, all of the end-points, on the horizontalbus segments, have the same latency to and from the top of the column.

A similar pattern of tapped delay line configuration can be seen in FIG.8. Thus, in the column node 78 which is most distant from the column busdriver 1, the tapped delay lines 18, 19 have a delay time, D, which isset to the minimum delay time, namely 0 clock cycles in this example.Again, the delay times are allocated by moving up the column, andincrementing the delay time by 1 clock cycle each time a pipelined node22 is passed. Thus, in FIG. 8, the node 79 is configured with the tappeddelay lines 18, 19 having a delay time D=1 clock cycle. This process canbe repeated until the column node nearest the column bus driver isreached.

When the delay time of a tapped delay line 18 is set to 0 clock cycles,the end points connected to that tapped delay line are in effect beingdriven by the preceding vertical bus pipeline register 23. This mayincrease the loading on the pipeline register excessively. Therefore, inpractice, the minimum delay time in the tapped delay lines 18, 19 may bechosen as 1 clock cycle, rather than 0, in order to reduce this loading.

The addressing of individual array elements is encoded in the signalstransferred over this bus structure as row, vertical bus column andsub-group column. The column bus driver 1 can decode the vertical buscolumn information to selectively enable the columns, or if a broadcasttype address is used then it can enable all of the columns. The rownodes 3 decode the row information and sub-group columninformation—hence they must be configured with this information, derivedfrom their placement. The column nodes 2 do not actively decode rowinformation in this illustrated embodiment of the invention, since thepower saving is not worth the complexity overhead at this granularity.However, in other embodiments, the column nodes could decode thisinformation in the same way that the column drivers and the row nodesdo, by snooping the bus protocol.

An array element is addressed if the bus activity reaches it, and allthe address aspects match. If single addressing is used, the destinationarray element decodes the communication if the row address and thesub-group column address match its own. If a broadcast type address isused, in order to communicate to more than one array element, then therow nodes have to discriminate based on some other identificationparameter, such as array element type. Broadcast addressing can beflagged either by a separate control wire, or by using“treserved”addresses, depending on which is most efficient.

Control of array elements, such as the synchronous starting, stopping orsinglestepping, is achieved by writing specific data into controlregister locations within the array elements. To address these together,in a broadcast communication, these control locations must therefore beat the same place in each array element's memory map. It is useful to beable to issue a singlestep control command, instructing the arrayelement to start for one step and then stop, because the addressingtoken overhead in the communications protocol prevents start and stopcommands being so close together.

It can also be advantageous, in order to avoid problems caused by largeclock skews, for example register setup or hold violations, by placingbuffers (to speed up or to delay signals) at certain points in thenodes. For example, in the case of a column node as shown in FIG. 2 or3, delay buffers may be inserted to prevent hold violations in buses 20and 25, and in the vertical bus 8 before and after the tap point 26 andafter the OR gate 17. In the case of a row node as shown in FIG. 4,delay buffers may be inserted to prevent hold violations in bus 59.

There is therefore provided an arrangement which achieves anapproximately constant latency. Communications to and from the farthestarray elements are suitably pipelined for the distance, whilecommunications to and from closer array elements are deliberately“over-pipelined” such that the latency to all end-point elements is thesame number of clock cycles. This allows a high bandwidth to beachieved, and is scaleable without having to redesign.

The communication itself takes the form of a tokenised stream and theprocessor array is seen as a hierarchical memory map, that is a memorymap of array elements, each of which has its own memory map of program,data and control locations. The tokens are used to flag array elementaddress, sub-address and read/write data. There are special reservedaddresses for addressing all array elements (or subsets) in parallel forcontrol functions.

A tokenised communications protocol, which may be used in conjunctionwith this processor array, is described in more detail below.

The outgoing bus is a 20 bit bus comprising 4 active-high flags and a 16bit data field:

Bit Range Description - Outgoing Bus [31:20] Reserved. [19] AEID flag.Used to indicate that the payload data is an Array Element “ID” oraddress. [18] ADDR flag. Used to indicate that the payload data is aregister or memory address within an Array Element. [17] READ flag. Usedto indicate that a read access has been requested. The payload data willbe ignored. [16] WRITE flag. Used to indicate that a write access hasbeen requested. The payload data is the data to be written. [15:0]Payload data - Array Element address, Register/ Memory address, data tobe written.

The return path is a 17 bit bus comprising an active-high valid flag anda 16 bit data field:

Bit Range Description - Return Path Bus [16] VALID flag. Indicates thatthe read access addressed an Array Element that exists. [15:0] Payloaddata - Data read back from register or memory location.

The VALID flag is needed where the full address space of Array Elementsis not fully populated. Otherwise it may be difficult to differentiatebetween a failed address and data that happens to be zero.

Basic Write Operation:—The sequence of commands to send over theoutgoing bus is as follows:

AEID, <array element address>

ADDR, <register/memory location>

WRITE, <data word>

The user could write to multiple locations, one after another, byrepeating the above sequence as many times as necessary:

AEID, <array element address 1>

ADDR, <register/memory location in array element 1>

WRITE, <data word>

AEID, <array element address 2>

ADDR, <register/memory location in array element 2>

WRITE, <data word>

etc.

If the AEID is going to be the same, there is no need to repeat it:

AEID, <array element address 1>

ADDR, <register/memory location 1>

WRITE, <data word for location 1 in array element 1>

ADDR, <register 1 memory location 2>

WRITE, <data word for location 2 in array element 1>

etc.

In each case, the data will be written into the Array Element locationso long as the Array Element exists and the register or memory locationexists and is writeable (some locations may be read-only, some may beonly writeable if the Array Element is stopped and not when it isrunning).

Auto-incremementing Write Operation:—To save time when writing tomultiple successive contiguous register or memory locations within asingle Array Element—as one might often do when loading an ArrayElement's program for example—use repeated WRITE commands. The interfacein the row node will increment the address used inside the Array Elementautomatically. For example:

AEID, <array element address>

ADDR, <starting register or memory location—“A”>

WRITE, <data for location A>

WRITE, <data for location A+1>

WRITE, <data for location A+2>

WRITE, <data for location A+3>

etc.

Where there are gaps in the memory map, or where it is required to moveto another Array Element, use the ADDR or AEID flag again to setup a newstarting point for the auto-increment, eg:

AEID, <array element address>

ADDR, <starting register or memory location—“A”>

WRITE, <data for location A>

WRITE, <data for location A+1>

WRITE, <data for location A+2>

ADDR, <new starting register or memory location—“B”>

WRITE, <data for location B>

WRITE, <data for location B+1>

AEID, <new array element address>

ADDR, <register or memory location>

WRITE, <data word>

etc.

Non-incrementing Write Operation:—Where it is required to defeat theautomatic incrementing of the register or memory location address, keepthe ADDR flag, together with the WRITE flag:

AEID, <array element address>

ADDR, <register location—“A”>

ADDR, WRITE, <data for location A>

ADDR, WRITE, <new data for location A>

It should be noted that there could be a long period of bus inactivitybetween commands 3 and 4 where the processor array continues to run. Infact, there is no need for any of these bus operations to be in acontiguous burst. There can be gaps of any length at any point. Theprotocol works like a state machine without any kind of timeout.

Broadcast Write Operation:—It is possible to write to all Array Elementsat once, or subsets of Array Elements by group. This broadcastaddressing could be indicated by an extra control signal, or be achievedby using special numbers for the AEID address.

In the example implementation, used for the processor array described inGB-A-2370380, the whole array could be addressed on an individualelement basis well within 15 bits, so the top bit of the 16 bit AEIDaddress could be reserved for indicating that a broadcast typecommunication was in progress.

To select Broadcast rather than single Array Element addressing, set theMSB of the AEID data field. The lower bits can then represent whichArray Element types you wish to address. In our example processor array,we have 8 array element types, their designations are hard-wired intothe configuration of their row-nodes:

Bits Description [15] Broadcast Addressing Mode Select [14:8] Reserved[7] Type 8 [6] Type 7 [5] Type 6 [4] Type 5 [3] Type 4 [2] Type 3 [1]Type 2 [0] Type 1

So, for example, to address all Type 7 Array Elements:

AEID, <0x8040>

ADDR, < . . . >

etc.

To address all Type 1, Type 2 and Type 4 Array Elements together:

AEID, <0x800b>

ADDR, < . . . >

etc.

Basic Read Request Operation:—The basic read operation is very similarto the basic write operation, the difference being the last flag, andthat the data field is ignored:

AEID, <array element address>

ADDR, <register or memory location>

READ, <don't care>

The location will be read from successfully so long as the Array Elementexists and the register or memory location exists and is readable (somelocations may only be readable if the Array Element is stopped and notwhen it is running). The data word read from the Array Element will besent back up the return path bus, in this example to be stored for laterretrieval in a FIFO.

Auto-incrementing Read Operation:—Again, very similar to thecorresponding write operation:

AEID, <array element address>

ADDR, <starting register or memory location—“A”>

READ, <don't care> (data will be fetched from location A)

READ, <don't care> (data will be fetched from location A+1)

etc.

Non-incrementing Read Operation:—Where it is required to defeat theautomatic incremementing of the register or memory location address,keep the ADDR flag, together with the READ.flag:

AEID, <array element address>

ADDR, <register location—“A”>

ADDR, READ, <don't care> (data will be fetched from location A)

ADDR, READ, <don't care> (data will be fetched from location A)

This could be useful if you want to poll a register for diagnosticinformation—for example a bit error rate metric.

Broadcast Read Operation:—The hardware in our example processor arraydoes not preclude performing a broadcast read, though its usefulness israther limited. Readback data from multiple Array Elements will bebitwise ORed together. Perhaps useful for quickly checking if the sameregister in multiple Array Elements is non-zero before going througheach one individually to find out which ones specifically.

Composite Operations:—As seen above, the tokenised style of the busallows for many permutations of commands of arbitrary length, and allowsshort-cuts in command overhead to be taken quite often. For example, itmay be useful to generate a stream to perform part of a memorytest—reading and writing each successive location of a memory:

AEID, <array element address>

ADDR, <starting memory location—“A”>

ADDR, READ, <don't care> (data will be fetched from location A, theaddress WILL NOT be incremented)

WRITE, <data word> (data word will be written to location A, the addressWILL be incremented)

ADDR, READ, <don't care> (data will be fetched from location A+1, theaddress WILL NOT be incremented)

WRITE, <another data word> (another data word will be written tolocation A+1, the address WILL be incremented)

etc.

There are therefore described a processor array, and a communicationsprotocol for use therein, which allow efficient synchronised operationof the elements of the array.

1. A processor array, comprising: a plurality of primary buses, eachconnected to a same primary bus driver, and each primary bus having arespective plurality of primary bus nodes thereon; respectivepluralities of secondary buses, each secondary bus connected to arespective one of said primary bus nodes; a plurality of processorelements, each connected to one of the secondary buses; and delayelements, implemented in the primary bus nodes, for delayingcommunications with processor elements connected to different ones ofthe secondary buses by different amounts, in order to achieve a degreeof synchronization between operation of said processor elements, whereineach of said primary and secondary buses is a bidirectional bus, fortransferring data from the primary bus driver to the processor elements,and for transferring data from the processor elements to the primary busdriver.
 2. A processor array as claimed in claim 1, wherein each primarybus node comprises a tap for tapping off a signal from the primary busdriver on the respective primary bus, and a delay line for delaying thetapped off signals.
 3. A processor array as claimed in claim 2, whereinat least some of said primary bus nodes comprise a delay element fordelaying the signals from the primary bus driver on the respectiveprimary bus.
 4. A processor array as claimed in claim 1, wherein eachprimary bus node comprises a device for combining a signal from therespective secondary bus onto the respective primary bus, and a delayline for delaying the signals from the respective secondary bus.
 5. Aprocessor array as claimed in claim 4, wherein the device for combiningcomprises a bitwise logical OR gate.
 6. A processor array as claimed inclaim 4, wherein at least some of said primary bus nodes comprise adelay element for delaying the signals towards the primary bus driver onthe respective primary bus.
 7. A processor array as claimed in claim 1,wherein each processor element is connected to the respective secondarybus at a respective secondary bus node.
 8. A processor array as claimedin claim 1, wherein each processor element is connected to therespective one of the secondary buses at a respective secondary busnode, and wherein each secondary bus node comprises a tap for tappingoff a signal from the primary bus driver on the respective secondarybus, and an interface for determining whether the tapped off signals areintended for the processor element connected thereto.
 9. A processorarray as claimed in claim 1, wherein each processor element is connectedto the respective one of the secondary buses at a respective secondarybus node, and wherein each secondary bus node comprises a device forcombining a signal from the respective processor element onto therespective secondary bus.
 10. A processor array as claimed in claim 9,wherein the device for combining comprises a bitwise logical OR gate.11. A processor array as claimed in claim 1, wherein the primary busdriver has an input bus, and a detector for determining which of saidplurality of primary buses should receive data on said input bus.
 12. Aprocessor array as claimed in claim 11, wherein the input bus of theprimary bus driver has a connection to each of said plurality of primarybuses through a first input of a respective AND gate, and said detectoris adapted to send an enable signal to a second input of the respectiveAND gate if it is determined that one of said plurality of primary busesshould receive data on said input bus.
 13. A processor array as claimedin claim 1, wherein the delay elements, implemented in primary bus nodeswhich are physically nearer to the primary bus driver, delaycommunications with processor elements connected to the secondary busesconnected to those nearer primary bus nodes, by longer delay times thanthe delay elements, implemented in primary bus nodes which arephysically further from the primary bus driver, delay communicationswith processor elements connected to the secondary buses connected tothose further primary bus nodes.
 14. The processor array as claimed inclaim 1, wherein the degree of synchronization between operations ofsaid processor elements is achieved by substantially equalizing a timetaken for signals to pass from the primary bus driver to each of theprocessor elements.
 15. The processor array as claimed in claim 1,wherein each primary bus is independently driven by said primary busdriver.
 16. The processor array as claimed in claim 15, wherein saidrespective plurality of primary bus nodes are serially connected to eachprimary bus.
 17. The processor array as claimed in claim 1, wherein eachof said primary and secondary buses comprises a first bus elementconfigured to transfer data in one direction and a second bus elementconfigured to transfer data in another direction.